Goto

Collaborating Authors

 cut-off value


PARs: Predicate-based Association Rules for Efficient and Accurate Model-Agnostic Anomaly Explanation

Feng, Cheng

arXiv.org Artificial Intelligence

Our user study shows that the anomaly explanation form of PARs is better understood and favoured by Anomaly detection, which aims to identify data instances regular anomaly detection system users compared with existing that do not conform to the expected behavior, is a classic model-agnostic anomaly explanation options. In our machine learning task with numerous applications in experiments, we demonstrate that it is significantly more various domains including fraud detection, intrusion detection, efficient to find PARs than anchors (Ribeiro, Singh, and predictive maintenance, etc. Over the past decades, numerous Guestrin 2018), another rule-based explanation, for identified methods have been proposed to tackle this challenging anomaly instances. Moreover, PARs are also far more problem. Examples include one-class classificationbased accurate than anchors for anomaly explanation, meaning (Manevitz and Yousef 2001; Ruff et al. 2018), nearest that they have considerably higher precision and recall when neighbor-based (Breunig et al. 2000), clustering-based applied as anomaly detection rules on unseen data other (Jiang and An 2008), isolation-based (Liu, Ting, and Zhou than the anomaly instance on which they were originally derived 2012; Hariri, Kind, and Brunner 2019), density-based (Liu, for explanation. Additionally, we show that PARs can Tan, and Zhou 2022; Feng and Tian 2021) and deep anomaly also achieve higher accuracy on abnormal feature identification detection models based on autoencoders (Zhou and Paffenroth compared with many state-of-the-art model-agnostic 2017; Zong et al. 2018), generative adversarial networks explanation methods including LIME (Ribeiro, Singh, and (Zenati et al. 2018; Han, Chen, and Liu 2021), to Guestrin 2016), SHAP (Lundberg and Lee 2017), COIN name a few.


Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

Bystrov, Victor, Naboka-Krell, Viktoriia, Staszewska-Bystrova, Anna, Winker, Peter

arXiv.org Artificial Intelligence

The use of topic modelling techniques, especially Latent Dirichlet Allocation (LDA) introduced by Blei et al. (2003), is growing fast. The methods find application in a broad variety of domains. In text-as-data applications, LDA enables the analysis of large collections of text in an unsupervised manner by uncovering latent structures behind the data. Given this increasing use of LDA as a standard tool for empirical analysis, also the interest in details of the method and, in particular, in parameter settings for its implementation is rising. Thus, since the introduction of the LDA approach in 2003 by Blei et al., different methodological components of LDA have already been studied in more detail as, for example, the choice of the number of topics (Cao et al., 2009; Mimno et al., 2011; Lewis and Grossetti, 2022; Bystrov et al., 2022a), hyper-parameter settings (Wallach et al., 2009), model design (e.g.


Establishing Central Sensitization Inventory Cut-off Values in patients with Chronic Low Back Pain by Unsupervised Machine Learning

Zheng, Xiaoping, Lamoth, Claudine JC, Timmerman, Hans, Otten, Ebert, Reneman, Michiel F

arXiv.org Artificial Intelligence

Human Assumed Central Sensitization is involved in the development and maintenance of chronic low back pain (CLBP). The Central Sensitization Inventory (CSI) was developed to evaluate the presence of HACS, with a cut-off value of 40/100 based on patients with chronic pain. However, various factors including pain conditions (e.g., CLBP), and gender may influence this cut-off value. For chronic pain condition such as CLBP, unsupervised clustering approaches can take these factors into consideration and automatically learn the HACS-related patterns. Therefore, this study aimed to determine the cut-off values for a Dutch-speaking population with CLBP, considering the total group and stratified by gender based on unsupervised machine learning. In this study, questionnaire data covering pain, physical, and psychological aspects were collected from patients with CLBP and aged-matched pain-free adults (referred to as healthy controls, HC). Four clustering approaches were applied to identify HACS-related clusters based on the questionnaire data and gender. The clustering performance was assessed using internal and external indicators. Subsequently, receiver operating characteristic analysis was conducted on the best clustering results to determine the optimal cut-off values. The study included 151 subjects, consisting of 63 HCs and 88 patients with CLBP. Hierarchical clustering yielded the best results, identifying three clusters: healthy group, CLBP with low HACS level, and CLBP with high HACS level groups. Based on the low HACS levels group (including HC and CLBP with low HACS level) and high HACS level group, the cut-off value for the overall groups were 35, 34 for females, and 35 for. The findings suggest that the optimal cut-off values for CLBP is 35. The gender-related cut-off values should be interpreted with caution due to the unbalanced gender distribution in the sample.


Meta learning with language models: Challenges and opportunities in the classification of imbalanced text

Vassilev, Apostol, Jin, Honglan, Hasan, Munawar

arXiv.org Artificial Intelligence

Out of policy speech (OOPS) has permeated social media with serious consequences for both individuals and society. Although it comprises a small fraction of the content generated daily on social media, sifting through the data to quickly identify and eliminate the toxic content is difficult. The scale of this problem has long passed a threshold that requires automated detection. Yet it remains to be a challenging problem for machine learning (ML) due to the way OOPS manifests itself in datasets: context-dependent, nuanced, non-colloquial language that may even be syntactically incorrect. Because the OOPS content of the dataset is usually only a small fraction of the overall size, there is a high imbalance between OOPS and in-policy text. Related to this, there are not many high-quality labeled datasets with consistent definitions of OOPS and in-policy content. The difficulties are exacerbated further by significant differences in the distributions of the datasets that the model has been trained on and the data it sees during deployment. When faced with all of these challenges, ML models applied to natural language processing (NLP) tasks quickly reach a performance ceiling that limits their usefulness for sensitive tasks, such as OOPS detection.


Unsupervised sequence-to-sequence learning for automatic signal quality assessment in multi-channel electrical impedance-based hemodynamic monitoring

Hyun, Chang Min, Kim, Tae-Geun, Lee, Kyounghun

arXiv.org Artificial Intelligence

This study proposes an unsupervised sequence-to-sequence learning approach that automatically assesses the motion-induced reliability degradation of the cardiac volume signal (CVS) in multi-channel electrical impedance-based hemodynamic monitoring. The proposed method attempts to tackle shortcomings in existing learning-based assessment approaches, such as the requirement of manual annotation for motion influence and the lack of explicit mechanisms for realizing motion-induced abnormalities under contextual variations in CVS over time. By utilizing long-short term memory and variational auto-encoder structures, an encoder--decoder model is trained not only to self-reproduce an input sequence of the CVS but also to extrapolate the future in a parallel fashion. By doing so, the model can capture contextual knowledge lying in a temporal CVS sequence while being regularized to explore a general relationship over the entire time-series. A motion-influenced CVS of low-quality is detected, based on the residual between the input sequence and its neural representation with a cut--off value determined from the two-sigma rule of thumb over the training set. Our experimental observations validated two claims: (i) in the learning environment of label-absence, assessment performance is achievable at a competitive level to the supervised setting, and (ii) the contextual information across a time series of CVS is advantageous for effectively realizing motion-induced unrealistic distortions in signal amplitude and morphology. We also investigated the capability as a pseudo-labeling tool to minimize human-craft annotation by preemptively providing strong candidates for motion-induced anomalies. Empirical evidence has shown that machine-guided annotation can reduce inevitable human-errors during manual assessment while minimizing cumbersome and time-consuming processes.


Learning Invariant Rules from Data for Interpretable Anomaly Detection

Feng, Cheng, Hu, Pingge

arXiv.org Artificial Intelligence

In the research area of anomaly detection, novel and promising methods are frequently developed. However, most existing studies exclusively focus on the detection task only and ignore the interpretability of the underlying models as well as their detection results. Nevertheless, anomaly interpretation, which aims to provide explanation of why specific data instances are identified as anomalies, is an equally important task in many real-world applications. In this work, we propose a novel framework which synergizes several machine learning and data mining techniques to automatically learn invariant rules that are consistently satisfied in a given dataset. The learned invariant rules can provide explicit explanation of anomaly detection results in the inference phase and thus are extremely useful for subsequent decision-making regarding reported anomalies. Furthermore, our empirical evaluation shows that the proposed method can also achieve comparable or even better performance in terms of AUC and partial AUC on public benchmark datasets across various application domains compared with start-of-the-art anomaly detection models.


The best metric to measure accuracy of classification models CleverTap

#artificialintelligence

As an analyst, if you are looking at a metric to measure and maximize the overall accuracy of the classification model, MCC seems to the best bet since it is not only easily interpretable but also robust to changes in the prediction goal.